FPGA-Based AI Smart NICs for Scalable Distributed AI Training Systems
نویسندگان
چکیده
Training state-of-the-art artificial intelligence (AI) models requires scaling to many compute nodes and relies heavily on collective communication operations, such as all-reduce, exchange the weight gradients between nodes. The overhead of these operations can bottleneck training performance number increases. In this paper, we first characterize all-reduce operation overhead. Then, propose a new smart network interface card (NIC) for distributed AI using field-programmable gate arrays (FPGAs) accelerate optimize bandwidth utilization via data compression. NIC frees up system's resources perform more compute-intensive tensor increases overall node-to-node efficiency. We build prototype 6-node system show that our proposed FPGA-based enhances by 1.6×, with an estimated 2.5× improvement at 32
منابع مشابه
Efficient Techniques for Distributed Implementation of Search-Based AI Systems
We study the problem of exploiting parallelism from search-based AI systems on distributed machines. We propose stack-splitting, a technique for implementing orparallelism, which when coupled with appropriate scheduling strategies leads to: (i) reduced communication during distributed execution; and, (ii) distribution of larger grainsized work to processors. The modified technique can also be i...
متن کاملDistributed Control for AI
This paper discusses a number of elementary problems in distributed computing and a couple of well-known algorithmic \building blocks", which are used as procedures in distributed applications. We shall not strive for completeness, as an enumeration of the many known distributed algorithms would be pointless and endless. We do not even try to touch all relevant sub-areas and problems studied in...
متن کاملMetadatabase Meets Distributed AI
Heterogeneous Distributed Database Management Systems (HDDBMS) involve the interoperability of data sources. One approach to achieve this type of integration is to build interfaces between the diierent databases being integrated. This approach holds, for a particular case, at a speciic point in time. In this case however, the database structures need to be adapted. Such adaptation is not advisa...
متن کاملAI{based Trading in Open Distributed Environments
An open distributed environment can be perceived as a service market where services are freely o ered and requested. Any infrastructure which seeks to provide appropriate mechanisms for such an environment has to include mediator functionality (i.e. a trader) that matches service requests and service o ers. Commonly, the matching process is based upon some IDL{based service type de nition, and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Computer Architecture Letters
سال: 2022
ISSN: ['2473-2575', '1556-6056', '1556-6064']
DOI: https://doi.org/10.1109/lca.2022.3189207